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1  Introduction 

Until  recently,  designers  of  parallel  scientific  programs  have  included  little  or  no  support  for  fault 
tolerance  in  their  applications.  This  attitude  has  been  justified  through  the  follows  observations: 
(i)  the  modest  size  of  existing  multiprocessor  hardware  platforms  have  made  failures  relatively 
rare  events,  (ii)  programming  fault-tolerant  applications  has  meant  mastering  complex  distributed 
computing  concepts,  (ill)  the  overhead  for  fault  tolerance  has  been  judged  to  be  too  high  with 
respect  to  desired  performance. 

With  the  advent  of  massively- parallel  machines  with  tens  of  thousands  of  processors  and  com¬ 
plex  interconnection  networks  (e.g.,  the  Connection  Machine  [20],  the  J-Machlne  [18]),  application- 
level  fault  tolerance  support  has  to  be  reconsidered.  In  machines  of  this  size,  hardware-based  fault 
tolerance,  such  as  those  employed  in  the  Tandem  [7]  and  Stratus  [35]  systems,  is  clearly  impractical. 
Mo  matter  how  reliable  the  individual  components  are,  the  sheer  size  of  these  systems  can  result 
in  a  significant  probability  of  failure  during  lengthy  computations.  If  parallel  applications  that  use 
large  numbers  of  processors  are  to  make  progress,  they  have  to  anticipate  the  possibility  of  partial 
failures  and  take  appropriate  steps  to  recover  from  them.  Unless  this  is  done,  reliability  will  become 
the  limiting  factor  in  the  parallelism  that  can  be  achieved  by  applications  [29]. 

.  Shared-mbmoiy'  multiprocessors  present  severe  architectural  problems  when  scaled  to  very  large 
dimensions.  It  is  widely  accepted  that  constructing  parallel  machines  that  can  scale  to  very  large . 
numbers  of  processors  will  be  pouible  only  for  distributed-memory  architectures.  Physical  prop¬ 
erties  of  these  machines  will  prevent  relying  on  a  global  clock  as  a  time  base  and  par.i.’U  failures 
will  result  in  loss  of  communication  or  computation  without  bringing  down  the  entire  system.  In 
other  words,  the  loose  coupling  that  is  dictated  by  size  will  render  these  machines  equivalent  to 
“distributed  systems  in  a  box.” 

Extremely  fast  networking  is  yet  another  trend  that  supports  this  “distributed  system”  view 
of  parallel  multiprocessors.,  With  the  possibility  of  Gigabits-per-second  communication  over  large 
geographic  distances  [21],  an  entire  network  of  machines  (parallel  or  scalar)  can  be  thought  of  as 
a  parallel  multiprocessor.  Existing  efforts  in  the  United  States  linking  distant  supercomputing 
centers  across  the  country  with  high-speed  commuxiicatiott  lines  support  this  observation.  Even  on 
a  more  modest  scale,  there  are  many  efforts  to  support  parallel  computing  over  networks  (Ethernet 
LANs)  of  workstations  [5,  8]. 

The  realization  that  paralld  multiprocessors  are  logically  (or  physically,  as  in  the'  case  of 
network-based  computing)  equivalent  to  distributed  systems  has  two  consequences.  First,  fault- 
tolerant  parallel  computing  in  distributed  memory  multiprocessors  has  to  be  solved  in  the  presence 
of  uncertainties  that  are  inherent  to  distributed  systems.  Second,  the  wide  body  knowledge  that  has 
been  accumulated  for  fault-tolerant  distributed  computing  can  be  directly  applied  to  fault-tolerant 
parallel  computing. 

The  remainder  of  this  paper  is  organized  as  follows.  In  the  next  section  we  present  a  brief 
survey  of  the  major  jpsktadigms  for  fault-tolerant  distributed  computing.  Transactions,  checkpoint¬ 
ing,  active  replication  and  passive  replication  are  examined  and  evaluated  u  possible  mechanisms 
for  fault-tolerant  parallel  computing.  Section  3  is  a  brief  introduction  to  the  ISIS  distributed  pro¬ 
gramming  toolkit  that  includes  the  necessary  primitives  for  implementing  a  wide  range  of  fault 
tolerance  mechanisms.  Section  4  is  an  overview  of  the  Paririex  programming  environment  that 
permits  parallel  applications  to  be  developed  and  executed  in  distributed  systems  with  automatic 
support  for  fault  tolerance.  Uso  of  passive  replication  to  render  narallei  programs  fault  tolerant  in 
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Paralex  is  discussed  in  Section  5.  We  conclude  the  paper  by  some  observations  derived  from  our 
experience  with  Paralex. 


2  Paradigms  for  Fault-Tolerant  Distributed  Computing 

Failures  in  a  system  can  result  in  the  loss  of  data  or  computation.  la  this  paper  we  will  be  addressing 
only  the  possibility  of  tolerating  failed  computations.  While  maintaining  data  correct  and  available 
is  an  equally  important  concern,  it  is  beyond  the  scope  of  this  p^er. 

All  other  things  being  equal,  a  distributed  application  wiU  be  less  reliable  than  its  centralized 
equivalent  —  there  are  simply  more  components  that  the  distributed  application  depends  upon  and 
each  can  fail  independently.  For  distributed  systems  to  be  useful,  they  have  to  be  fault  toler^nt. 

Tolerating  failures  in  any  system  requires  some  form  of  redundancy.  In  time  redundancy,  the 
failed  computation  is  restarted  on  the  same  processor  (once  the  cause  of  the  failure  has  been  elim¬ 
inated)  or  on  another  processor  and  Repeated  until  it  completes  successfully.  In  space  redundancy, 
the  computation  is  carried  out  on  several  physically  independent  processors  in  parallel  and  a  vote 
is  taken  to  extract  a  single  output  from  the  (potentially  different)  results. 

It  is  clear,  that  space- redundant  systems  are  more  expensive  in  toms  of  computational  resources. 
In  return,  they  are  able  to  mask  out  failures  and  continue  producing  correct  outputs  with  no  loss  in 
performance^.  This  makes  space-redundant  systems  suitable  for  time-critical  applications  such  as ' 
process  control.  Time-redundant  systems,  on  the  other  hand,  go  through  a  recovery  phase  where 
no  useful  computation  is  being  carried  out.  They  also  require  the  ability  to  detect  failures  before 
(incorrect)  results  are  communicated  externally.  Parallel  scientific  applications  typically  do  not 
have  critical  timing  constraints  to  justify  the  cost  of  space  redundancy.  If,  however,  the  parallelism 
available  in  the  hardware  exceeds  that  usable  by  the  application,  the  extra  processors  may  be  put 
to  good  use  by  running  replicas  of  the  primary  computation. 

Achieving  fault  tolerance  through  redundancy  in  distributed  sjrstems  requires  that  computa¬ 
tions  on  different  processors  cooperate.  The  lack  of  shared  memory  and  the  lack  of  a  global  clock 
makes  reasoning  about  such  systems  a  difficult  task.  Since  message  exchange  is  the  only  means 
of  communication  and  it  incurs  random  delays,  it  is  impossible  for  any  one  component  to  have  an 
instantaneous  view  of  t.'ie  global  computation  state.  The  possibility  of  processor  and  communica¬ 
tion  failures  further  increases  the  level  of  uncertainty  in  these  ssrstems  and  adds  to  their  conceptual 
difficulty. '  We  can  hope  to  master  fault-tolerant  distributed  computing  oniy  through  the  use  of 
appropriate  paradigms  that  abstract  away  many  of  these  complexities  [30,  16].  In  the  following 
sections  we  prfssent  some  of  these  paradigms. 

2.1  'Lransactioils 

Transactions  were  originally  proposed  as  a  software  structuring  mechanism  for  applications  that 
accessed  shared  data  ou  secondary  storage  (typically  a  database)  [19].  In  this  model,  computations 
are  divided  into  units  of  work  called  transaetione.  The  sjrstem  guarantees  three  properties  for 
transactions:  atomicity,  serializability  and  permanence.  Atomicity  is  with  respect  to  failures  in  the 
sense  that  the  execution  of  a  tran^tion  is  or  nothing”  —  failures  never  leave  intermediate 
states  of  a  transaction  visible  to  other  tr^actions.  Serializability,  on  the  other  hand,  requires  that 

*  As  wt  sksU  as*,  them  is  a  slight  degradatioe  ia  peribnasac*  daa  ta  th*  dtasneiaatimi  of  iapats  to  th*  mplicu 
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the  effect  of  concurreat  execution  of  several  transactions  be  equivalent  to  some  serial  execution  (one 
after  the  other  in  some  arbitrary  order).  Permanence  guarantees  that  computations  make  progress 
despite  failures  since  their  results  will  never  be  undone. 

Programming  with  transactions  presents  to  the  user  an  idealized  world  where  failures  and  other 
concurrent  transact!' jns  have  been  abstracted  away.  The  system  automatically  restarts  transactions 
if  a  failure  interrupts  their  execution  part  way  or  if  serializability  cannot  be  guaranteed.  Once  a 
transaction  commits,  it  can  be  sure  that  the  data  values  written  are  as  if  it  executed  in  isolation 
and  without  any  failures.  Thus,  transactions  transform  the  system  from  one  consistent  state  to 
another.  By  definition,  transaction  boundaries  always  define  consistent  system  states  from  which 
a  computation  can  recover.  The  basic  transaction  model  has  been  extended  to  distributed  sys¬ 
tems  [26]. 

One  of  the  drawbacks  of  the  transactional  model  is  that  fault  tolerance  cannot  be  integrated 
transparently  to  applications.  Programs  must  explictly  use  the  transaction  paradigm  by  announc¬ 
ing  the  beginning  and  end  of  transactions  within  programs  at  opporttine  points.  Furthermore, 
while  the  serializability  requirement  may  be  appropriate  for  database  applications,  it  can  be  overly 
restrictive  for  parallel  computations  that  do  not  access  shared  files.  The  overhead  introduced  by 
the  complex  mechanisms  that  implement  the  transaction  abstraction  may  be  significant  for  most 
parallel  applications. 

Modem  systems  that  adopt  the  transaction  model  as  the  basis  for  fault- tolerant  distributed . 
computing  include  Arjuna  [32],  Argus  [27]  and  Camelot  [33]. 

2.2  Checkpointing 

An  arbitrary  distributed  computation  could  be  made  fault  tolerant  without  having  to  structure  it 
as  a  collection  of  transactions.  All  that  is  required  is  a  mechanism  whereby  computations  can  be 
restarted  from  some  past  state  in  response  to  failures.  To  prevent  having  to  restart  computations 
always  from  the  very  beginning,  and  thus  guarantee  forward  progress,  the  state  of  the  failure-free 
execution  is  periodically  saved  to  stable  storage^.  The  saved  past  states  are  called  checkpoints. 
Restoring  the  system  to  a  set  of  checkpoints  and  repeating  the  lost  computations  is  called  recovery. 
The  frequency  with  which  checkpoints  are  taken  is  a  system  tuning  parameter  and  establishes  the 
relative  costs  of  the  failure-free  execution  overhead  and  recovery  delays. 

In  a  system  where  computations  interact  by.exchanging  messages,  recovery  of  a  failed  compu¬ 
tation  from  an  arbitrary  set  of  checkpoints  may  result  in  an  mconsistent  global  system  state  [14]. 
Intuitively,  recovery  should  never  be  attempted  from  a  system  state  in  which  some  computation 
appears  to  have  received  messages  that  have  not  yet  been  sent.  The  manner  in  which  global  system 
state  consistency  is  guaranteed  results  in  two  distinct  strategies. 

2.2.1  Optimistic  Recovery 

The  general  strategy  is  to  design  algorithms  with  the  guess  that  failures  will  not  occur  at  inoppor¬ 
tune  times.  As  a  recovery  strategy,  this  leads  to  establishing  checkpoints  without  any  coordination 
among  the  components.  However,  the  system  must  have  collected  sufficient  information  along  the 
way  so  that  exactly  those  computations  that  have  to  recover  do  so  in  case  failures  occur.  For 

*Subte  storage  is  a  memory  device  whoee  coateais  aervive  all  (ailates  abort  of  diaastera.  It  is  typically  implemeated 
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example,  in  the  scheme  proposed  by  Strom  and  Yemlni  [34],  checkpointing  and  message  logging 
occur  concurrently  with  computation  and  coiimunication.  Causality  information  is  maintained 
such  that  recovery  wul  occur  from  a  consistent  global  system  state.  In  a  variant  of  the  scheme, 
messages  are  logged  in  the  nonvolatile  memory  of  the  sender  rather  than  the  receiver,  resulting  in 
even  further  concurrency  of  stable  storage  writes  with  respect  to  computations  [22|.  An  unfortunate 
consequence  of  optimistic  strate^es  is  that  recovery  time  is  difficult  to  bound  since,  in  addition  to 
the  failed  computation,  an  arbitrary  number  of  others  may  need  to  recover. 

2.2.2  Conservative  Recovery 

A  reasonable  alternative  to  the  above  strategy  is  to  structure  the  checkpointing  mechanism  in 
a  manner  such  that  the  set  of  latest  checkpoints  is  always  guaranteed  to  represent  a  consistent 
system  state.  To  prevent  computations  from  having  to  recover  arbitrarily  past  states,  conservative 
schemes  synchronize  checkpointing  with  computation  and  communication;  This  has  the  desirable 
consequence  thht  recovery  is  both  simple  and  more  predictable  in  the  delays  it  introduces  to  the 
system.  The  cost,  obviously,  is  shifted  from  recovery  to  checkp<^inting. 

One  way  to  guarantee  consistency  of  checkpoints  is  to  force  each  comi'ntation  to  record  its  state 
after  every  message  send  operation  end  before  doing  anything  else.  Recovery  consists  of  the  failed 
computation  rolling  back  to  its  most  recent  checkpoint.  This  simple  mechanism  can  be  extended  to, 
cope  with  missing  messages.  Unfortunately,  this  naive  solution  is  impractical  since  checkpointing 
to  a  stable  store  after  every  send  will  introduce  significant  delays  to  the  computation.  We  return 
to  this  issue  in  Section  2.4  where  we  discuss  passive  replication. 

Consistency  of  checkpoints  can  be  guaranteed  even  when  they  are  taken  much  less  frequently. 
Koo  and  Toueg  present  a  distributed  algorithm  that  guarantees  the  set  of  most  recent  checkpoints  to 
always  represent  a  consistent  state  [23|.  A  unilateral  checkpoint  action  forces  the  toinimuin  number 
of  additional  computations  to  checkpoint  along  with  it.  Recovery  also  invefives  the  minimum  number 
of  computations  that  are  affected  by  the  failure. 

2.3  Active  Replication 

Given  that  a  distributed  system  contains  multiple  processing  elements  with  independent  failure 
modes,  a  distributed  service  can  be  made  more  reliable  by  performing  it  in  parallel  on  several 
processors.  This  simple  idea  contains  numerous  subtleties  that  have  to  be  addressed  before  it  can 
be  made,  effective. 

If  a  cdlection  of  replicas  is.  to  be  functionally  equivalent  to  a  single  component,  it  must  accept 
the  same  input  and  produce  the  same  output.  Clients  of  the  replicated  service  continue  to  interact 
with  it  as  if  it  were  implemented  as  asin^e  component.  On  the  client  side,  code  fragments  intercept 
the  client  request  and  distribute  it  to  the  replicas.  On  the  service  side,  code  fragments  intercept 
an  incoming  request  and  engage  in  communication  with  all  of  the  replicas  of  the  s^vice  to  achieve 
the  input  dissemination. .  Finally,  the  outputs  must  be  coalesced  in  to  a  single  value.  All  of  this 
code  to  wrap  around  clients  and  services  can  be  generated  automatically  using  technology  similar 
to  Remote  Procedure  Call  (RPC)  stub  generation  [12]. 

We  begin  with  the  problem  of  coalescing  the  output.  If  only  benign  failures^  are  to  be  tolerated, 
then  the  first  output  to  be  produced  by  some  replica  can  be  taken  as  the  component  output.  To 
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tolerate  up  to  k  such  failures,  it  clearly  suffices  to  have  it  + 1  replicas.  If  failures  can  cause  incorrec  : 
results  to  be  produced  by  the  replicas^,  then  a  majority  vote  wti  determine  the  output.  This  clearly 
requires  2ib  +  1  replicas  to  tolerate  up  to  k  failures.  It  also  re-quires  a  (reliable)  component  to  act 
as  the  voter. 

Distributing  the  input  to  the  replicas  is  even  more  subtle.  For  the  above  voting  scheme  to 
work,  all  correct  replicas  must  produce  the  same  output.  This  requL-es  that  they  all  see  the  same 
irput  and  that  the  computations  they  perform  be  deterministic.  The  input  must  be  disseminated 
such  that  either  all  or  none  of  the  replicas  see  it.  Protocols  that  achieve  this  in  the  presence  of 
failures  are  called  reliable  broadcast  protocols  [15,  3}.  If  the  service  interacts  with  multiple  clients, 
the  replicas  must  not  only  see  the  same  input,  but  also  see  them  in  the  saune  order.  Achieving 
this  in  the  presence  of  failures  requires  the  use  of  an  atomic  broadcast  protocol  [17].  Depending 
on  the  failure  assumptions  and  the  system  model,  achieving  atomic  broadcast  may  require  3A;  +  1 
replicas  to  tolerate  up  to  k  failur'ss  [25].  Thus,  this  may  be  the  dominant  factor  in  determining  the 
replication  level  rather  than  simple  majority. 

The  above  ideas  have  been  expounded  in  a  general  methodology  called  the  state  machine 
approach  for  automatically  adding  fault  tolerance  to  distributed  services  [31].  It  is  important 
to  note  that  while  active  replication  can  result  in  higher  reliability  of  services  in  the  short  run, 
these  systems  become  less  reliable  than  their  non-replicatsd  counterparts  in  the  long  run  [2,  36]. 
To  mai'*.tain  reliability  levels  sufficiently  high  over  long  intervals,  it  must  be  possible  to  vary  the 
number  of  replicas  dynamically  —  failed  ones  must  be  removed  off  line  and  new  or  repaired  ones 
brought  on  line.  The  difficulty  in  achieving  this  is  maintaining  a  consistent  view  of  the  replica  set 
among  the  processors.  This  in  turn  requires  a  solution  to  the  group  membership  problem  [28]. 

2.4  Passive  Replication 

While  active  replication  is  able  to  mask  f^ures  without  any  recovery  delays,  it  is  costly  —  all  of  the 
replicas  compute  actively  consuming  resources.  Unless  the  system  has  an  abundance  of  processors, 
the  approach  may  not  be  practical.  Passive  replication  offers  a  more  economical  alternative.  The 
service  is  replicated  just  as  before,  however,  only  one  of  the  replicas  computes  while  the  others 
remain  dormant.  If  the  initial  Computation  reaches  completion,  no  farther  action  is  necessary.  If 
a  failure  prevents  the  first  replica  from  completing,  one  of  the  dormant  copies  is  activated  and 
resumes  computing  from  where  it  last  left  off.  Thus,  in  the  failure-free  scenario,  no  computation  is 
wasted*. 

Several  observations  are  in  cyrder.  First,  the  technique  is  effective  only  against  benign  failures 
—  it  is  not  possible  to  detect  incorrect  results.  Second,  there  must  be  a  failure  detector  so  that  a 
passive  replica  may  be  started  if  the  initial  computation  fails.  Third,  input  to  the  replicas  must  be 
disseminated  atomically  just  as  in  active  replication.  Finally,  the  technique  incurs  a  delay  while  the 
newly-activated  replica  '^catches  up’'  with  the  failed  computation  by  procasing  its  input  queue. 

By  far  the  most  common,  realization  of  passive  replication  involves  twe  copies,  one  known  as 
the  primary  and  the  other  as  the  secondary  backup  [7,  13].  In  this  scheme,  each  communication 
step  requires  atomically  delivering  the  message  to  three  destinations:  the  secondary  of  the  sender 
and  the  two  copies  of  the  destination.  When  a  secondary  takes  over  upon  the  failure  of  the  primary, 
it  recovers  by  processing  the  messages  in  its  input  queue.  Having  seen  the  messaga  sent  by  the 

*TlteM  types  of  failares  ue  sometimes  called  meikioes  or  Bysentoie. 

*Thete  is  a  small  orerhead  ia  keeping  the  icpUcas  coordiaated  as  discaased  later. 


primary  before  it  failed  serves  to  prevent  the  secondary  from  resending  them  during  recovery.  Only 
when  the  swondarv  has  reached  the  state  of  the  primary  before  it  failed,  does  it  engage  in  active 
message  sending. 

While  at  first  sight  the  primary- secondary  replication  scheme  may  seem  very  different  from 
checkpointing,  the  two  are  actually  logically  equivalent.  Consider  the  checkpointing  scheme  with 
conservative  recovery  where  the  computation  is  checkpointed  after  every  send  operation.  If  these 
synchronous  operations  are  to  occur  to  stable  storage,  the  delays  would  be  intolerable.  Rnther  than 
representing  a  checkpoint  as  a  process  memory  image  on  disk,  we  could  choose  to  represent  it  as  a 
process  state  on  another  processor  (the  secondary)  along  with  a  count  of  sent  messages.  Replaying 
the  enqueued  input  messages  ^t  the  secondary  and  discarding  a  number  of  output  messages  equal 
to  the  primary  count  effectively  restores  the  secondary  state  to  that  of  the  primary  at  the  point 
of  the  last  send  before  the  failure,  The  technique  trades  off  delays  in  checkpointing  (an  atomic 
three-way  multicast  rather  than  a  writ^  to  stable  store)  with  those  of  recovery  (computation  rather 
than  restoring  the  state  from  stable  store).  Given  that  failures  a  relatively  rare  in  most  systems, 
the  approach  is  very  reasonable. 

3  The  ISIS  Distributed  Programming  Toolkit 

From  the  above  discussion,  a  relatively  small  number  of  abstractions  have  emeirged  as  being  nec¬ 
essary  for  implementing  a  wide  renge  of  fault  tolerance  paradigms.  Furthermore,  we  have  seen 
that  replication  plays  a  fundamental  role  in  achieving  fault  tolerance.  The  ISIS  tooUut  has  been 
designed  to  facilitate  easy  construction  of  efficient  distributed  programs  and  to  make  them  fault 
tolerant  [9, 11]. 

As  we  discussed  in  Section  2,  the  principal  difficulty  in  reasoning  about  distributed  systems  is 
the  uncertainty  due  to  communication  and  failures.  Without  the  appropriate  tools,  a  programmer 
has  to  consider  an  extremely  large  number  of  possible  executions  when  developing  applications.  For 
example,  a  message  broadcast  to  a  group  of  processes  by  simple  send  operations  may  be  received 
by  some  amd  not  received  by  others.  Two  concurrent  broadcasts  to  the  same  set  of  processes  may 
be  received  in  a  different  order  by  some  of  the  members.  Events  corresponding  to  processes  joining 
or  leaving  (either  vcluntaxily  or  due  to  a  failure)  a  computation  may  be  perceived  by  the  members 
in  different  order  with  respect  to  ongoing  communication.  The  ISIS  toolkit  tries  to  put  order  to 
this  complex  world.  By  using  the  appropriate  communication  primitives  and  relying  on  lower-level 
support,  many  of  the  events  in  a  distributed  system  can  be  made  to  appear  as  if  the.'  occurred 
at  the  saune  instant  in  all  components  of  a  computation.  The  resulting  system,  called  virtually 
synchronous^  offers  tremendous  intellectual  economy  to  application  developers  [lOj. 

ISIS  runs  on  a  large  numbers  of  systems  and  extends  the  basic  operating  system  primitives 
with  the  following  abstractions: 

Process  Gronpe  These  are  the  principal  structuring  constructs  for  ISIS  applications.  A  process 
group  is  a  named  collection  of  procesM.  Process  groups  may  overlap  in  arbitrary  ways 
to  reflect  the  natural  structure  of  the  application.  Group  membership  is  dynanuc  in  that 
processes  may  join  or  leave  at  wilL  A  built-iu  failure  detectoi  turns  failures  into  group 
departures  of  the  appropriate  processes.  The  group  name  may  be  used  to  address  all  current 
members  without  having  to  know  their  individual  identities.  There  are  no  restrictions  on  the 


computaticE  being  carried  out  by  the  mcnbers  ^  a  group  —  they  need  not  be  replicas  of  the 
same  computatiop. 

Group  Communication  Applications  in  ISIS/ue  stnictured  as  communicating  process  groups. 
Ail  data  exchanged  between  groups  are^coded  as  ISIS  messages,  providing  a  uniform  repre¬ 
sentation  across  heterogeneous  archi^tures.  ISIS  protocols  ensure  that  if  a  message  broad¬ 
cast  to  a  group  is  received  by  one  o^ts  members,  it  is  received  by  ^  of  its  members,  despite 
benign  processor  and  communica^n  failures.  With  respect  to  ordering,  ISIS  provides  three 
alternatives:  / 


FIFO  Broadcast  Only  h/raadcasts  originating  from  the  same  source  are  received  in  the  same 
order  by  the  proc^d  group  members. 

Causal  BroadcastyOnly  broadcasts  that  are  causally  related  are  received  in  the  same  order. 
Two  broadc^is  are  said  to  be  causally  related  if  there  exists  a  chain  of  communication 
events  sucMhat  one  can  affect  the  contents  of  the  other  [24].  Unrelated  broadcasts  may 
be  orde^  arbitrarily.  ISIS  maintains  the  causality  relation  even  across  process  group 
bouni^^ies. 

Atonu;:^roadcast  All  broadcasts  to  the  group  are  received  in  the  same  order  by  all  of  its 
^^embers.  This  is  true  even  for  broadcasts  that  are  causally  unrelated.  Wliile  the  cost 
of  FIFO  and  Causal  broadcasts  are  comparable.  Atomic  broadcast  incurs  a  quantitative 
Increase  in  time  delays. 


State  Transfer  To  facilitate  coordination  among  group  members,  ISIS  provides  a  mechanism 
whereby  the  state  of  one  member  is  copied  to  another.  What  constitutes  the  process  state  is 
application  dependent  and  is  specified  by  the  programmer.  State  transfers  are  typically  used 
to  initialise  the  state  of  a  new  process  joining  a  group.  As  with  the  join  event  itself,  the  state 
transfer  is  ordered  consistently  by  all  group  members  with  respect  to  communication  events. 


Given  the  above  abstractions,  it  is  possible  to  implement  almost  all  of  the  paradigms  of  Sec¬ 
tion  2.  The  lack  of  relevant  concepts  such  as  seriallsability  and  atomic  commitment  make  transac¬ 
tions  difScnlt  to  implement  in  ISIS. 

Realizing  active  replication  through  process  groups  is  inunediate.  Each  computation  to  be 
made  fault  tolerant  is  replicated  to  form  a  process  group.  All  point-to-point  communication  is 
replace  with  atonvc  broadcasts  to  the  relevant  groups  to  achieve  input  dissemination.  Since 
clients  may  be  replicated  in  addition  to  servers,  each  inpurrequest  may  be  received  multiple  times 
by  the  members  of  the  server  group.  SomS  deterministic  function  (e.g.,  majority,  mean,  median) 
will  have  to  be  applied  to  the  copies  of  the  input  to  ^ect  the  value  to  use.  This  corresponds  to  the 
output  voting  st^  of  the  active  replication  scheme.  Even  when  the  replication  level  is  dynamic, 
the  input  extraction  function  can  be  implemented  by  the  replicas  interrogating  the  current  group 
membership. 

Process  groups  also  form  the  basis  for  passive  replication.  Just  before  they  start  computing,  all 
members  of  a  process  group  invoke  the  coordlnator«cohort  tool  of  ISIS  which  effectively  selects 
one  member  (the  coordinator)  to  continue  computing  while  the  others  (cohorts)  remain  inactive.  If 
ISIS  detects  the  failure  of  the  coordinator  before  its  role  comes  to  completion,  it  will  nominate  one 
of  the  cohorts  to  the  role  of  coordinator  and  resume  its  execution.  While  requests  are  disseminated 
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to  group  members  using  atomic  broadcast  as  in  active  replication,  there  is  no  need  for  a  voting 
(input  extradition)  function  since  only  one  output  will  be  produced  (that  of  <:he  coordinator). 

Finally,  the  state  tramsfer  mechanism  of  ISIS  provides  a  way  to  implement  fault  tolerance 
through  checkpointing.  State  transfers  can  be  requested  either  to  another  proc^  or  to  a  di.«k  hie. 
To  the  extent  that  a  disk  approximates  stable  storage,  a  failed  computation  caa  be  resumed  from 
the  most  recent  state  found  in  the  file. 

4  Parallel  Computing  in  Distributed  Systems  with  Paralex 

Paralex  is  a  programming  environment  for  developing  parallel  applications  and  executing  them  on 
a  distributed  system,  typically  a  network  of  workstations.  Programs  are  spedhed  in  a  graphical 
notation  and  Paralex  automatically  hamdles  distribution,  communlcatimt,  data  representation,  ar- 
chitectural  heterogeneity  and  fault  tolerance.  It  consists  of  four  logical  components:  A  graphics 
editor  for  program  specification,  a  compiler,  an  executor  and  a  runtime  support  environment.  These 
components  are  integrated  within  a  uniform  graphical  programmiitg  environment.  Here  we  give  a 
brief  overview  of  Paralex.  Details  can  be  found  in  [5]. 

The  programming  paradigm  supported  by  Paralex  is  a  restricted  form  of  data  flow  [1].  A 
Paralex  program  is  composed  of  nodes  and  links.  Nodes  correspond  to  computations  and  the  links 
indicate  the  flow  of  (typed)  data.  Thus,  Paralex  programs  can  be  thought  of  as  directed  graphs  (and 
indeed  are  visualized  as  such  on  the  screen)  representing  the  data  flow  idatkms  plus  a  coUeci  toe  of 
ordinary  code  fragments  to  indicate  the  computations.  The  current  pmtotype  limits  the  structure 
of  the  data  flow  graph  to  be  acyclic. 

The  semantics  associated  with  this  graphical  syntax  obeys  the  so-called  “strict  enabling  rule” 
of  data-driven  computations  in  the  sense  that  when  all  of  the  links  inddest  at  a  node  contain 
values,  the  computation  assodated  with  the  node  starts  execution  transfonning  the  input  data  to 
an  output.  The  computation  to  be  performed  by  the  node  must  satisfy  the  “functional’*  paradigm 
—  multiple  inputs,  only  one  output  with  no  side  effects.  The  actual  specification  of  the  computa¬ 
tion  may  be  done  using  whatever  appropriate  notation  is  available  induding  stand^ird  sequential 
programming  languages,  parallel  programming  notations  (if  the  distributed  system  indudes  nodes 
that  are  tbem&slves  multiprocessors),  executable  binary  code  or  library  functions  for  the  relevant 
architectures.  ' 

Unlike  classical  dnta  flow,  the  nodes  of  a  Paralex  program  carry  out  significant  computations. 
This  so-called  large-grain  data  flow  modd  [6]  is  a  concequence  of  the  ^opertka  of  the  underlying 
distributed  system  where  we  seek  to  keep  the  communication  overhead  via  a  high-latency,  low- 
bandwidth  network  to  reasonable  levels. 

.  Tnere  are  many  situations  where  the  single  output  value  produced  by  a  node  needs  to  be. 
communicated  to  multiple  deatinations  as  input  so  as  to  create  paraUd  computation  structures. 
In  Paralex,  this  is  accomplished  simply  by  drawing  multiple  output  links  originating  from  a  node 
towards  the  va^us  destinations.  To  economize  on  net'vork  bandwidth,  Paralex  introduces  the 
not’on  of  filter  nodes  that  allow  data  vdu.4  to  be  extracted  on  a  per-dsstiuation  basis  before  they 
are  tranemitted  to  the  next  node.  Conceptually,  filters  are  defined  and  manipulated  just  as  regular 
nodes  and  their  “computations”  ate  specified  through  programs.  In  practice,  howaver,  all  of  the 
data  filtering  computations  are  executed  in  the  context  of  the  single  process  that  produced  the  data 
rather  than  as  separate  procesaea  to  minimize  the  system  overhead. 

Once  the  user  has  fully  specified  the  Paralex  program  by  drawing  tke  data  flow  graph  and 
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supplying  the  computations  to  be  carried  out  by  .he  nodes,  the  program  can  be  compiled.  The  first 
pass  of  the  ParaJex  compiler  is  actually  a  precompiler  to  generate  all  of  the  necessary  stubs  to  wrap 
around  the  node  computations  to  achieve  data  reprfca.cr«tation  independence,  remote  communication 
and  replica  management  for  those  nodes  with  fault  tolerance  needs.  Type  checking  across  links  is 
also  performed  in  this  phase.  Currently,  Paralex  generates  all  of  the  stub  code  as  ordinary  C  As 
the  next  step,  the  C  compiler  is  invoked  to  turn  each  node  into  an  executable  module. 

The  Paralex  compiler  must  also  address  the  two  aspects  of  heterogeneity:  data  representation 
and  instruction  sets.  Paralex  uses  the  ISIS  tcoUut  as  the  infrastructure  to  realize  a  universal 
data  representation.  All  data  that  is  passed  from  one  node  to  another  during  the  computation 
are  encapsulated  as  ISIS  messages.  Heterogeneity  r/ith  respect  to  instruction  sets  is  handled  by 
invoking  remote  compilations  on  the  machines  of  interest  arid  storing  multiple  exscutables  for  the 
nodes. 

The  Paralex  executor  launches  the  parallel  computation  on  the  distributed  system  respecting 
all  architectural  constraints.  Details  of  how  Paralex  computation  graphs  are  mapped  onto  the 
hosts  of  a  distributed  system  and  how  the  execution  is  monitored  and  controlled  dynamically  are 
described  in  [4]. 

5  Replication  in  Paralex 

Oue  of  the  primary  characteristics  that  distinguishes  a  distributed  .tyitem  from  a  special  purpose 
super  computer  is  the  possi'aUity  of  partial  failures  during  computations.  These  failures  may  be 
due  to  real  hardware  faults  or,  more  probably,  as  a  consequence  of  user  actions  such  as  rebooting 
or  turning  off  workstations.  To  render  distributed  systems  suitable  for  long-running  parallel  com¬ 
putations,  automatic  support  for  fault  tolerance  must  be  provided.  The  Paralex  run-time  system 
contains  the  primitiyea  necessary  to  support  fault  tolerance  and  dynamic  load  balancing. 

A.<  part  of  the  program  dzfinit.ion,  Paralex  permits  the  user  to  specify  a  fault  tolerance  level 
for  the  computation  graph.  Paralex  will  generate  all  of  the  necessary  code  such  that  when  a  graph 
with  fault  tolerance  k  is  executed,  each  of  its  nodes  will  be  executed  on  k  +  I  (distinct  hosts  to 
guarantee  success  for  the  computation  despite  up  to  k  failures.  Failures  that  are  tolerated  are 
of  the  benign  type  for  processor*  (i.a.,  all  processes  running  on  the  processor  simply  halt)  aind 
communication  components  (i.e.,  m>i3sages  may  be  lost).  There  is  no  attempt  to  guard  against 
more  malicious  processor  failures  nor  against  failures  of  non- replicated  components  such  as  the 
network  interconnect. 

Paralex  uses  passive  replicatica  as  the  basic  fault  tolerance  technique.  Given  the  application 
domain  (parallei  sdsntiiic  computing)  and  hardware  platform  (networks  of  workstations),  Paralex 
favors  efficient  use  of  computational  lesources  over  short  recovery  times  in  its  choice  of  a  fault 
tolerance  mechanism.  Passive  replication  not  only  satisfies  this  objective,  it  provides  a  uniform 
mechanism  for  dynamic  load  balancing  through  late  binding  of  computations  to  hosts. 

Pa'alex  uses  the  ISIS  coordtnator~cohort  toolkit  to  implement  passive  repUcatiou.  Each  node 
of  the  computation  that  requires  fault  tolerance  is  instantiated  as  a  process  group  consisting  of 
replicas  for  the  node.  One  of  the  group  members  is  called  the  coordinator  in  that  it  will  actively 
compute.  The  remaining  mem .  r.n  are  cohorts  and  remain  inactive  other  than  receiving  broadcasts 
addressed  to  the  group.  When  "SIS  detects  the  failure  of  the  coordinator,  it  automatically  promotes 
one  of  the  cohorts  to  the  foie  of  cocrdlnav^r. 

Data  flow  from  one  nodv  uf  a  Paralex  program  to  another  results  in  a  broadcast  from  the 
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coordinator  at  the  source  group  to  the  destination  process  group.  Only  the  coordinator  of  the 
destination  node  will  compute  with  the  data  value  while  the  cohorts  simply  buffer  it  in  an  input 
queue  associated  with  the  link.  When  the  coordinator  completes  computing,  it  broadcasts  the 
results  to  the  process  groups  at  next  level  and  signals  the  cohorts  (through  another  intra-group 
broadcast)  so  that  they  can,  discard  the  buffered  data  item  corresponding  to  the  input  for  the 
current  invocation.  Given  that  Pualex  nodes  implement  pure  functious  and  thus  have  no  internal 
state,  recovery  from  a  failure  is  trivial  —  the  cohort  that  is  nominated  the  new  coordinator  simply 
starts  computing  with  the  data  at  the  head  of  its  input  queues. 


Figure  1;  Replication  and  Group  Communication  for  Fault  Tblerance. 

Figure  1  illustrates  some  of  these  issues  by  considering  a  3-node  computation  graph  shown  at 
the  top  as  an  example.  The  lower  part  of  the  Hgure  shows  the  process  group  representation  of 
the  nodes  based  on  a  fault  tolerance  specification  of  2.  Arrows  indicate  message  arrivals  with  time 
running  down  vertically.  The  gray  process  in  each  group  denotes  the  current  coordinator.  Note 
that  in  the  case  of  node  A,  the  initial  coordinator  fails  during  its  computation  (indicated  by  the 
X).  The  process  jroup  is  reformed  and  the  right  replica  takes  over  as  coordinator.  At  the  end  of  its 
execution,  the  coordlhator  performs  two  broadcasts.  The  first  serves  to  communicate  the  results 
of  the  computation  to  the  process  group  implementing  node  C  and  the  second  is  an  internal  group 
broadcast.  The  cohorts  use  the  message  of  this  internal  broadcast  to  conclude  that  the  current 
buffered  input  will  not  be  needed  since  the  coordinator  successfully  computed  with  it.  Note  that 
there  is  a  small  chance  the  coordinator  will  fail  after  broadcasting  the  resultsi  to  the  next  node  but 
before  having  informed  the  cohorts.  The  result  of  this  scenario  would  be  multiple  executions  of 


a  node  with  the  same  (logical)  input.  This  is  easily  prevented  by  tagging  each  message  with  an 
iteration  number  and  ignoring  any  input  messages  with  duplicate  iteration  numbers. 

The  execution  depicted  in  Fifore  1  may  appear  deceptively  simple  and  orderly.  In  a  distributed 
system,  other  executions  with  inopportune  node  failures,  message  losses  and  event  orderings  may  be 
equally  possible.  What  simplifies  the  Paralex  run-time  system  immensely  is  structuring  it  on  top  of 
ISIS  that  guarantees  “virtual  synchrony”  with  respect  to  message  delivery  and  other  asynchronous 
events  such  as  failures  and  group  membership  changes.  Paralex  cooperates  with  ISIS  toward  this 
goal  by  using  a  reliable  broadcast  comimunication  primitive  that  respects  causality  [24]. 

6  Conclusions 

We  have  airgued  that  current  large-scale  parallel  multiprocessors  have  properties  not  unlike  dis¬ 
tributed  systems.  With  expected  increases  in  the  scale  of  parallel  machines  and  increases  in  net¬ 
work  bandwidth  of  distributed  systems,  the  distinction  between  them  is  rapidly  fading.  This  leads 
us  to  conclude  that  future  parallel  anplications  will  have  to  confront  fault  tolerance  just  as  current 
distributed  systems  have  to.  Furthermore,  the  same  tools  and  techniques  to  render  distributed 
systems  fault  tolerant  can  be  effectively  used  to  render  parallel  applications  fault  tolerant. 

Of  the  various  paradigms  developed  for  fault-tolerant  distributed  computing,  passive  replication 
and  checkpointing  are  probably  the  most  appropriate  for  parallel  computing.  In  fact,  we  have  seen 
that  passive  replication  can  be  viewed  as  a  special  case  pf  checkpointing.  Modem  distributed 
programming  toolkits  include  the  necessary  technologies  for  implementing  a  wide  spectrum  of  fault 
tolerance  techniques. 

While  technologies  such  as  ISIS  are  sufiicient  for  fault-tolerant  parallel  computing,  they  still 
require  extensive  distributed  computing  exi>ertise  to  program  with.  Higher-level  interfaces  are 
required  if  fault  tolerance  is  to  be  used  widely  in  parallel  applications.  Paralex  represents  one 
such  interface.  By  carefully  selecting  the  programming  constructs,  fault  tolerance  can  be  added 
automatically  to  parallel  applications.  Paralex  is  proof  that  this  can  be  accomplished  without 
unreasonable  penalties  in  performance. 
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